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^ • Abstract 

, One of the most attractive recent approaches to processing well-structured large-scale 

• I convex optimization problems is based on smooth convex-concave saddle point reformu- 

lation of the problem of interest and solving the resulting problem by a fast First Order 
CS| ■ saddle point method utilizing smoothness of the saddle point cost function. In this paper, 

we demonstrate that when the saddle point cost function is polynomial, the precise gra- 
dients of the cost function required by deterministic First Order saddle point algorithms 
^ . and becoming prohibitively computationally expensive in the extremely large-scale case, 

^ I can be replaced with incomparably cheaper computationally unbiased random estimates 

of the gradients. We show that for large-scale problems with favourable geometry, this 
randomization accelerates, progressively as the sizes of the problem grow, the solution 
process. This extends significantly previous results on acceleration by randomization, 
which, to the best of our knowledge, dealt solely with bilinear saddle point problems. We 
illustrate our theoretical findings by instructive and encouraging numerical experiments. 
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1 Introduction 



The goal of this paper is to develop randomized First Order algorithms for solving large-scale 
"well structured" convex-concave saddle point problems. The background and motivation for 
our work can be briefly outlined as follows. Theoretically, the entire Convex Programming is 
within the grasp of polynomial time Interior Point Methods capable to generate high-accuracy 
solutions at a low iteration count. However, the complexity of an IPM iteration, in general, 
grows rapidly (as n^) with the design dimension of the problem, which in numerous applications 
(like LP's with dense constraint matrices arising in Signal Processing) make IPM's prohibitively 
time-consuming in the large-scale case. There seemingly is consensus that "beyond the practical 
grasp of IPM's," one should use the First Order Methods (FOM's) which, under favorable cir- 
cumstances, allow to get medium-accuracy solutions in (nearly) dimension-independent number 
of relatively cheap iterations. Over the last decade, there was a significant progress in FOM's; 
to the best of our understanding, the key to this progress is in discovering a way (Nesterov 
2003, see [11]) to utilize problem's structure in order to accelerate FOM algorithms, speciflcally, 
to reduce a convex minimization problem minx^x f{x) with potentially nonsmooth objective / 
to a saddle point problem 

min max (f){x , y) , (SP) 
where is a C^'^ convex-concave function such that 

/(x) = max(/)(x,y). (1) 

The rationale is as follows: when / is nonsmooth (which indeed is the case in typical ap- 
plications), the (unimprovable in the large-scale case) rate of convergence of FOM's directly 
applied to the problem of interest minx^xf{x) is as low as 0{l/\/i), so that finding a feasi- 
ble e-optimal solution takes as much as 0(l/e^) iterations. Utilizing representation (1), this 
rate can be improved to 0{l/t); when X, Y are simple, this dramatic acceleration keeps the 
iteration's complexity basically intact. 

Now, in the original Nesterov's Smoothing [11], (1) is used to approximate / by a C^'^ 
function which is further minimized by Nesterov's optimal algorithm for smooth convex mini- 
mization. An alternative is work on (SP) "as it is," by applying to {SP) an 0(l/t)-converging 
saddle point FOM, like the Mirror Prox algorithm [8]; in what follows, we further develop this 
alternative. 

When solving {SP) by a FOM, the computational effort per iteration has two components: 
(a) computing the values of V0 at 0(1) points from Z — X x Y, and (b) "computational 
overhead," like projecting onto Z. Depending on problem's structure and sizes, any one of 
these two components can become dominating; the approach we are developing in this paper 
aimed at the situation where the computational "expenses" related to (a) by far dominate those 
related to (b), so that the "practical grasp" of the usual - deterministic - saddle point FOMs 
as applied to (SP) is restricted with the problems where the required number of computations 
of V0 (which usually is in the range for hundreds) can be carried out in a reasonable time. An 
attractive way to lift, to some extent, these restrictions is to pass from the precise values of V0, 
which can be prohibitively costly computationally in the large-scale case, to computationally 
cheap unbiased random estimates of these values. This idea (in retrospect, originating from 
the ad hoc sublinear type matrix game algorithm of Grigoriadis and Khachiyan [3]) has been 
developed in several papers, see [1, 9, 4, 2, 7], [6, section 6.5.2] and references therein. To 
the best of our knowledge, for the time being "acceleration via randomization" was developed 
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solely for the case of saddle point problems with bilinear cost function (f). The contribution of 
this paper is in extending the scope of randomization to the case of when is a polynomial. 

The main body of this paper is organized as follows. In section 2, we formulate the problem 
of interest and present the necessary background on our "working horse" — Mirror Prox algo- 
rithm. In section 3, we develop a general randomization scheme aimed at producing unbiased 
random estimates of V0 for a polynomial 0. Theoretical efficiency estimates for the resulting 
randomized saddle point algorithm are derived in section 4. In section 5, we illustrate our ap- 
proach by working out in full details two generic examples: optimizing the maximal eigenvalue 
of a quadratic matrix pencil, and low dimensional approximation of a finite collection of points. 
We show theoretically (and illustrate by numerical examples) that in both these cases, in a 
meaningful range of problem's sizes and e, solving problem within accuracy e by randomized 
algorithm is by far less demanding computationally than achieving the same goal with the best 
known to us deterministic competitors, and the resulting "acceleration by randomization" goes 
to oo as the problem sizes grow. 



2 Situation and Goals 



2.1 Problem Statement 



Consider the situation as follows: let X C E^^y C Ey be convex compact subsets of Euchdean 
spaces, and let y) : E :— E^ x Ey ^ Hhe a, polynomial of degree d: 



d 



= J]Qfc(z,..^, (2) 



where Qq is a constant, and for A; > 0, Qk{z^, z^) is a fc-linear symmetric form on E . From 
now on we assume that (j){x, y) is convex-concave on X xY , that is, convex in a; € X for fixed 
1/ G F, and concave in y & Y for fixed x & X. 

SadVal = min max <p{x,y). (3) 

Let 

Opt(P) = min := max^/ey y)] (P) 
0^i{D) = max := min^^ex y)] 

yeY — 

be the primal-dual pair of convex programs associated with (3), so that Opt(P) = Opt(£'), let 

DualityGap(a:, y) = \^{x) - Opt(P)] + [Opt(D) - = 4>{x) - ^{y) (5) 

be the associated duality gap, and, finally, let 

F{z, y) = [F,{x, y) = y); Fy{x, y) := -^^(x, y)] : Z -.^ X xY ^ E -.^ E, x Ey (6) 

be the monotone mapping associated with (3). Our ideal goal is, given tolerance e > 0, to find 
an e-solution to (3), i.s., a point — {xe,ye) £ Z such that 

DualityGap(2;g) < e, (7) 

whence x^ is a feasible e-optimal solution to (P), while y^ is a feasible e-optimal solution to 
(D). We intend to achieve this goal by utihzing randomized First Order saddle point algorithm, 
specifically. Stochastic Mirror Prox method (SMP) [4] . 
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2.2 Background on Stochastic Mirror Prox algorithm 

The setup for SMP as applied to (3) is given by 

• A norm || ■ || on the subspace 

L[Z] := Lm{Z - Z) 

in the embedding space E :— x Ey of the domain Z :— X x Y oi the saddle point 
problem. The (semi)norm on E conjugate to || • || is denoted by || • ||*: 

IKII*= max{(C,^):||^||<l}; 
zeL[z] 

• a distance-generating function (d.g.-f.) uj{z) : Z ^ H which should be convex and 
continuously differentiable on Z, should admit continuous on Z° :— {z & Z : dui{z) ^ 
0} selection uj\z) of subgradients and should be compatible with || • ||, meaning strong 
convexity of modulus 1, w.r.t. || • ||: 

{uj'{z) - uj'{z'), z-z')> \\z - z'f V(^, z' eZ). 

A SMP setup induces several important entities, specifically 

• a;-center z^^ :— a.rgmm^^z^{z) of Z; 

• Bregman distance Vz{w) := uj{w) — cj{z) — {uj'{z),w — z), where z E Z° and w E Z. By 
strong convexity of a;, we have Vz{w) > ^\\w — zW^; 



• u-radius VL := ^^/2[maxz u!{-) — mm z uj{-)]; noting that — z^^lp < Vz^{w) < u!{w) — 
uj{zu), we conclude that 

V(w e Z) : \\w - z\\ < n; (8) 

• Prox-mapping Prox2;(^), z E Z°, ^ E E, defined as 

Prox^(^) = argmin^g^ [(^, w) + ^^(w)] = argmin^^^ [(^ - i^'{z),w) + uj(w)] 

As applied to (3), SMP operates with Stochastic Oracle representation of the vector field F 
associated with the problem. A Stochastic Oracle is a procedure ( "black box" ) which, at t-th 
call, a point Zt being the input, returns the random vector 

g{zt,^t)^F{zt) + A{zt,^t)^E 

where A(-, •) is a deterministic function, and ^i,^2,--- is a sequence of i.i.d. "oracle noises." 
The SMP algorithm is the recurrence 



initialization: Zi — z^] 

search points: Zt Wt = Viox-^Xlt9{zt,i2t-i)) ^ Zt+i = Y'raKzXlt9{wt,^2t)) ^ ■■■ 

approximate solutions: z*' = (2^*, = [St=i T-^]"^ St=i ^-^'"^t" 

(9) 

where 7* > are deterministic stepsizes. 

The main results on SMP we need are as follows (see the case M = = of [4, Corollary 

1]): 



4 



Theorem 2.1 Assume that £ < oo and a < oo are such that 



(a) \\F{z) - F{z')\U < £11^- ^'11 yz,z' e Z 

(6) E^{A{z,0} = Oyz e Z (10) 
(c) E^{\\A{z,0\\l}<a'WzeZ 



Then for every t — 1,2, ... the t-step SMP with constant stepsizes 

1 n 



7^ = mm 

ensures that 



,l<T<t (11) 



t ' Vt 



(12) 



E{DualityGap(a;*,|/*)} < Kt : = max 
In addition, strengthening (10. c) to 

E^{A{z,^)} = 0, E{cxp{||A(z,e)||>'}} < cxp{l} (13) 
we have an exponential bound on large deviations: for every A> 0, we have 

Prob I Duality Gap (a;*, y^) > Kt + ^^^| ^ exp{-A73} + exp{-At}. (14) 

3 Randomization 

Problem (3) by itself is a fully deterministic problem; with "normal" representation of the 
polynomial (f){x,y) (e.g., by list of its nonzero coefficients), a precise {a = 0) deterministic 
oracle for F is available; utilizing this oracle, a solution of accuracy e is obtained in 0{l)Q^C/e 
iterations, with computational effort per iteration dominated by the necessity to compute the 
values of F at two points and the values of two prox-mappings. When Z is "simple enough," 
the complexity of the second of these two tasks - computing prox-mappings - is a tiny fraction 
of the complexity of precise computation of the values of F. Whenever this is the case, it might 
make sense to replace the precise values F (which can be very costly in the large-scale case) 
with computationally cheap unbiased random estimates of these values. This is the option we 
intend to investigate in this paper. We start with a general description of the randomization 
we intend to use. 

Observe, first, that 

F{z) = DV(l){z) 

where D — Diag{Ida;, — Id^}, Id^, and Idy being the identity mappings on and Ey, respec- 
tively. Now, representing the polynomial (f){z) as 



(l^i^) ^^Qk{Zj_^: (15) 

where Qk{z^, z^) is a symmetric /c- linear form on E, we have 

d 

{F(z), h) = {DV(l>(z), h) = (V0(^), Dh) = Yl kQk(Dh, z, ...^ (16) 



k=l 
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Now assume that we can associate with every z E Z a, probabihty distribution on E such 
that 

j ^dP.iO = zyzeE. (17) 
In order to get an unbiased estimate of F{z), one can act as follows: 

• given z, draw d — 1 independent samples z^ P^, i = I, ...,d — 1 

• compute the linear form G = G[z^, ^"^"^j on E given by 

d 

yhe E : {G,h) = J2 kQk{Dh, z\ z^, z'"'^). (18) 

k=i 

thus ensuring that 

E(.i,...,..-i)^P,x...xP.{G'[^\ z""-']} = F{z) yz e Z. (19) 

Note that we can represent a random variable distributed according to Pz as a deterministic 
function of 2; as a standard random variable ^ uniformly distributed on [0, 1], which makes G 
a deterministic function of z and ^ ~ Uniform[0, 1], as required by our model of a Stochastic 
Oracle. 

Observe that for a general-type convex-concave polynomial (f){x, y) of degree d, precise de- 
terministic computation of F{z) is as suggested by (18) with Pz being the unit mass sitting at 
the singleton z, that is, with = ... = z'^~^ = z. It follows that if the distributions Pz, for every 
z E Z are such that eomputing the vectors of coefEcients of the hnear forms Qk{h, z^, z''^^) 
ofhEE is much cheaper than the similar task for the linear forms Qk{h, z, ...,z) for a "general 
position" z E Z, then computing the unbiased estimate G — G[z^, z^~^] of F{z) is much 
cheaper computationally than the precise computation of F{z), so that there are chances for 
the outlined randomization to reduce the overall complexity of computing e-solution to (3). Let 
us look at two simple preliminary examples: 

Example 1 ["Scalar case"]: E is just the space R" of n-dimensional vectors, and the /c-linear 
forms Qjfc(-) are given by lists of their nonzero coefficients. In this case, we can specify P^ as 
follows: 

Given z E E = R"\{0}, let Pz be the discrete probability distribution supported on the set 
{/* = sign(2;j)||2;||iej}"^;^, where e* are the standard basic orths in E, with the probability mass 
of /' equal to when z = 0, let Pz be the unit mass sitting at the origin. We clearly 

have E/^p^{/} = z, and all realizations oi f ^ Pz are extremely sparse — with at most one 
nonzero entry. Now, in order to generate / P^, wc need preprocessing of 0{l)n a.o. aimed 
to compute and the "cumulative distribution" sq = 0, Sj = X]j=i kil' ^ ~ Ij---?''^- 

With this cumulative distribution at hand, to draw a sample f Pz takes just 0(1) ln(n) a.o.: 
we draw at random a real a uniformly distributed in [—1, 1] (which for all practical purposed 
is jus 0(1) a.o.), find by bisection the smallest i e {l,...,n} such that a < Si (O(l)ln(n) 
a.o.) and return the vector / = sign(2;j)||2;||iej (0(1) a.o.). Thus, generating z^,...,z'^^^ costs 
0{l)[n + d\n{ny\ a.o. Now, with our "ultimately sparse" z^, z'^~^, computing the coefficients 
of the linear form Qk{h, z^, z^~^) oik takes at most n[d+C) a.o., where C is the (upper bound 
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on) the cost of extracting the coefficient of the A;-hnear symmetric form, given its "address." 
The bottom line is that the complexity of computing G[z^, .., z"^'^] is 

Cr[P] ^0{l)d[n + dl-a{n)+n{d + C)] ^ 0{l)[d'^n + dnC] a.o. 

On the other hand, computing F{z) exactly costs something like 

d 

Cd[P] = 0{l)[n + Y,kNkC] a.o. 
fc=i 

where is the total number of nonzero coefficients in Qk{-, ■)• Assuming that d = 0(1), 
we see that unless all Qk are pretty sparse - just with Nk = 0{n) nonzero coefRcients, mim- 
icking unbiased Stochastic Oracle takes by orders of magnitude less computations than precise 
deterministic computation of F{z). 

4 Complexity Analysis 

The discussion in the previous section demonstrates that in some interesting cases unbiased 
random estimates of the vector field F associated with (3) are significantly cheaper compu- 
tationally than the precise values of F. This does not mean, however, that in all these cases 
randomization is profitable — it well may happen that as far as the overall complexity of e- 
solution is concerned, expensive high-quality local information is better than cheap low quality 
one. We intend to analyze the situation in the regime when the degree d of the polynomial is 
a small integer formally treated as 0(1), so that we ignore the details of dependence of hidden 
factors in the estimates to follow on d. 

4.1 Preliminaries 

Standing Assumptions. Observe that 

L[Z] := Lin(Z - Z) ^ Un{X - X) x Un{Y -Y) ^ L[X] x L\Y]. (20) 

Now, the sets 

X' = ]pc - x], Y' = ^[y - y], z' = ^[z -z]^x' xY' 

are unit balls of certain norms || • \\x on L[X], \\ ■ ||y on L\Y] and || • || on L[Z], with 

\\{x,y)\\ = max[i|a;||x, hhi x G L[X],y G L[Y]. (21) 
From now on, we make the following 

Assumption A. The just defined norm \\ • \\ with the unit ball ^[Z — Z] is the 
norm used in the SMP setup, while the d.-g.f uj{x, y) is of the form iJx{x) + coyiy), 
where (|| ■ ||x,'^x(-)) (II ' ||y)'^y(")) form SMP setups for {X,Ex) and {Y,Ey) 
respectively^ . 

^Note that such a sum indeed is a d.-g.f. fitting the norm || • ||. 
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Note that 



• We have 

ll[e;^]||* = lieik* + ||r?||y,*, (22) 

where || • and || • \\y,* are the (semi)norms conjugate to || • \\x, \\ ■ \\y, respectively. In 
particular, we have 

\\F{z)\U = ||V0(z)|U \\F{z) - Fiz')\U = m{z)-V<j>iz')\U yz,z' e E. (23) 

• The a;-radius Q of Z is 

Q = -V / Vl\ + = . /2[maxa'x(a;) — mina;x(a;)], Vty = /2[maxa;y(|/) — mhaoYiy)] 

V Y a;eX Y yeY yeY 

(24) 

Scale factor. When speaking about complexity of finding e-solution, we will express this 
complexity in terms of the relative accuracy v — e/V, where the scale factor V is defined as 
follows. Let Z be the convex hull of {0} U Z, and let 

d 

^{z) = 4>{z) - 0(0) - ((/)'(0), z)^Y. -' 

k=2 

We set 

V = Vz[0] := max^g^J(^) - m.m.^^^4>{z). (25) 

The importance of this scale factor in our contents stems from the following simple observation 
(see also Lemma 4.2 below): 

Lemma 4.1 For properly chosen positive real C*^^-* depending solely on d, for all k, 2 < k < d 
and all collections z^, z'^ of vectors from L[Z] one has 

k 

mz\...,z')\<C^''^yl[\\z% (26) 

i=l 

In particular, the vector field F{z) associated with (3) satisfies (10. a) with 

d 

C = C(i) V k{k - 1)2^=-' := C^^^V, (27) 

k=2 

where C^^^ depends solely on d. 

For proof, see Appendix. 

An immediate question related to the definition of the scaling factor is: as defined, a "shift 
of the problem by a G -E" - a simple substitution of variables z = w — a - changes the factor and 
thus the complexity estimates, although such a substitution leaves the problem "the same." The 
answer is as follows: while the "shift option" should be kept in mind, such a shift changes the 
Stochastic Oracle as given by (18). Indeed, this oracle is defined in terms of the homogeneous 
components in the Taylor decomposition of taken at the origin, and this is why the origin is 
participating in the description of Z and thus of V. Shifting the origin, we, in general, change 
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the SO ^, and thus there is nothing strange that our scahng of the accuracy (and thus - the 
efficiency estimates) corresponding to a given Z and a given (imphcitly participating in (18)) 
SO are not translation-invariant. 



4.2 Complexity Analysis 



Preliminaries. From now on we assume that as applied to (3), SMP utilizes Stochastic 
Oracle SO given according to (18) by a family of probability distributions V — {Pz : z E Z} 
on E satisfying (17). Prom now on, we make the following 

Assumption B. For some p > 0, all distributions Pz, z E Z, are supported on the 
set Z + 2pZ' C Aff(Z), where Z' = \[Z - Z] and AS.{Z) is the afEne hull ofZ. 

In particular, when Pz is supported on Z for all z E Z ("proper case"). Assumption B is 
satisfied with p = 0. 

It is time now to note that the SO we have developed so far gives rise to a parametric family 
of Stochastic Oracles, specifically, as follows. First of all, our basic SO in fact can be "split" 
into two Stochastic Oracles, SO^ and SO^, providing estimates of the x- and the y-components 
Fx, Fy of F[z) = [Fx{z); Fy[z)\. the estimates 

Ex3Gx = Gx[z\...,z<'-^]:yieEx:{Gx,e)-Yi=ikQk{[i;%z\...,z^-^), 



Ey 3 Gy 



Gy[z\...,z<'-'] -.yrieEy-. {Gy,rj) ^-T.t=ikQk{[0;v],z\...,z'^-'). 



Here, as above, z^, ...,z'^~^ are, independently of each other, sampled from P^. Now, given two 
positive integers kx,ky, we can "recombine" our "partial stochastic oracles" SO^, SO^ into 
a new Stochastic Oracle SOk^^ky as follows: in order to generate a random estimate of F{z) 
given z E Z, we generate {d — 1) max[A;2:, independent samples z'^ ~ Pz, 1 < k < d — 1, 



1^7"^ kxy 



max[/i;^, k^ and then set 
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{Z^} l<k<d-l, 



tt'^^AzI, z^ ^]; — '^Gy[zl.,...,z^ ^ 



T=l 



(28) 



In the sequel, we refer to kx and ky as the x- and y- multiplicities of the Stochastic Oracle 

kx fky ' 

We will make use of the following 

Lemma 4.2 Under Assumptions A, B, for all positive integer multiplicities kx, ky, SOk^^ky 
ensures validity of (10.6), same as the validity of (13) with 



a = C(3)y^^ ^ ^^^^^ fix/ V^] + min[l, fiy/^] 

where depends solely on d. 

For proof, see Appendix. 

We have arrived at the following 



(29) 



^For example, with <j){x,y) = x^, the oracle (18) is G = [3a;^a;^;0], = [a;*;0] ~ P^- Substituting x = 1 + h, 
carrying out the construction of the SO "in /i-variable" and translating the result back to x-variable, the 
resulting SO turns out to be G = [3x^x^ + 3x^ — 3x^; 0], which is not the oracle we started with. 
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Theorem 4.1 Let t > 1 be given, let Assumptions A, B be satisfied, and let problem (3) be 
solved by t-step SMP utilizing SOk^^ky, 'with the parameters C, a underlying the stepsize policy 
(11) 's given by (27), (29). Then, for some C depending solely on d. 



(a) E{DualityGap(x*, y*)} < K{t) := C 



+ 



^rt 



V 



1? = min[l, VLx/\/kx\ + niin[l, Vty / yky] 
ih) Prob |DualityGap(a;*, y') > K{t) + CA ^"'^+"'^^'^ -^^^^| < exp{-AV3} + exp{-At}. 



VA > 0. 
(30) 



5 Illustrations 

We illustrate the proposed approach by two examples. The first of them is of a purely academic 
nature, the second can pretend to be of some practical interest, when selecting the examples, 
our major goal was to illustrate randomization schemes different from the one in Example 1. 

5.1 Illustration I: minimizing the maximal eigenvalue of a quadratic 
matrix pencil 

The problem we are interested in is as follows: We are given a symmetric matrix quadrati- 
cally depending on the "design variables" Xi, which themselves are matrices: 

'^(^) = X] [«r^J(i)?i^j(i)ai + bJxj{^i)Ci + cfxj(i)6i] + deEy:= S™, (31) 

i=l 

where 

• S*" is the space of m x m symmetric matrices equipped with the Frobenius inner product, 

• X — {xj e R"^j^"3].^^_|^ is a collection of variable matrices which we treat as a block- 
diagonal rectangular matrix with diagonal blocks Xj, 1 < j < J. We denote the linear 
space of all these matrices by and equip it with the Frobenius inner product; 

• e {1, J}, 1 < i < /, are given integers, 

• {aj, hi, Ci, qi}l=i, d are data matrices of appropriate sizes and structures: 

ai,Ci e R"Kox™, bi e R'^K^x™, q. e S™^«, d e S™; 

in addition, we assume that all qi are positive semideGnite, and that the values 

1 < i < I, cover the entire range 1 < J < J, meaning that every one of the blocks Xj 

indeed participates in A{-). 

For a matrix a £ W^'^, let a{a) — [(Ti(a); o"min[p,g](a)] be the vector of singular values of a 
arranged in the non-ascending order, and let ||a||nuc = ||o'(a,)||i be the nuclear norm of a. for a 
symmetric matrix a, let Amax(fl) be the maximal eigenvalue of a. Finally, let 

X ^ {x e : \\x\\nuc < 1}- 
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Our goal is to solve the optimization problem 



Opt = mm{X^^{A{x))}, (32) 

Denoting by Y the standard spectahedron in S*": 

Y^{yeS^:yhO,Tr{y)^l} 

and observing that Xniaxio.) = maXj,{Tr(a?/) : y G Y}, we can convert the problem of interest 
into the saddle point problem as follows: 

Opt = minmax [(j){x, y) :— Ti{yA{x))] . (33) 

From Qi 0, i < I, and the fact that gi + y ^ for all y e F it immediately follows that the 
restriction of on 7/ e y is convex in x & E^; ets a, function of |/, is a convex-concave on 
X xY polynomial of degree d — 3. The monotone mapping (6) associated with (33) is 

F'xix, y) = 2Diag{^..^.(.)^^.fea;jaj2/af + biycj], I < j < J} e E^, 

Fy{x,y) = -A{x), ^""^^ 

Now let us apply to (33) the approach we have developed so far. 

A. First, let us fix the setup for SMP. We are in the situation when X* := — X] is 
X - the unit ball of the nuclear norm on E^^; thus, || • \\x is the nuclear norm on E^.. The 
set Y^ — I [y — y ] clearly is contained in the unit nuclear norm ball of S"^ and contains the 
concentric nuclear norm ball of radius 1/2, meaning that || • ||y is within factor 2 of the nuclear 
norm: 

2||y||nuc > WvWy > Ibllnuc G S™ = Ey. 

The best, within 0(1) factors, known so far under circumstances choice of the d.-g.f.'s is (see 
[6, section 5.7.1] or Propositions A. 3, A. 2 in Appendix) 

ujx{x) = 0(1) ln(n) X)"=i ^f'^\^)^ ^ = E/=i minfm^-, n^], q{n) = ^T^, 3) ^^^^ 
^Y{y) = C>(l)ln(m)X)^i(7^^'"^(y), p(m) = 2hfc)' 

with explicitly given absolute constants 0(1). This choice is reasonably good in terms of the 
values of the corresponding radii of A, Y which turn to be "quite moderate:" 

< 0(l)Vln(n), < 0(l)Vln(m). (36) 

Note that the efficiency estimate (30) says that we are interested in as small values of Qx, ^y 
as possible. At the same time, it is immediately seen that if uj{-) is a d.-g-.f. for Z compatible 
with the norm generated by Z (i.e., with the unit ball Z"^ = ^[Z — Z], then the o'-radius of Z 
is at least 0(1), so that Qx- ^y are "nearly as good" as the could be. 

The outlined d.-g.f.'s are also the best known under circumstances in terms of the com- 
putational complexity of the associated prox-mapping; it is easily seen that this complexity 
is dominated by the necessity to carry out singular value decomposition of a matrix from E^ 
(which takes 0(^^ m^n^ min[mj, n^]) a.o.) and eigenvalue decomposition of a matrix from S"^ 
(O(m^) a.o.), see below. 



^To avoid trivial situations, we assume from now on that m > 1, n > 1. 
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B. With our approach, the "basic" option when solving (33) is to use the deterministic 
version of SMP, i.e., to use as the unit mass sitting at z. The corresponding efficiency 
estimate can be obtained from (30) by setting = ky = oo; taking into account (36), the 
resulting estimate says that a solution to (33) of a given accuracy e < V will be found in course 
of 

Nd{e/V)^0{l) ln(mn)V/e (37) 

iterations. Now let us evaluate the arithmetic complexity of an iteration. From the description 
of the algorithm it is clear than the computational effort at an iteration is dominated by the 
necessity to compute exactly 0(1) values of the monotone mapping (34) and of 0(1) prox 
mappings. To simphfy evaluating the computational cost of an iteration, assume from now on 
that we are in the simple case: 

ruj = Uj = u, 1 < j < J. 
In this case, computing 0(1) values of the prox mapping costs 

Cprox = 0{l)[m^ + Ju^] a.o. 

Indeed, with our ojxi'), computing the x-component of prox mapping reduces to solving 
the optimization problem ^^^v&E^,\\v\\rmc<^i'^7=i "^f (^) ~ Tr(5(-^t!)] with a given p G (1, 2] 
and a given g E E^. To solve the problem, we compute the singular vahic decompositions 
of all diagonal blocks gj in g, this getting a representation g = C7Diag{7}y-^ with block- 
diagonal orthogonal matrices U, V, which takes 0(1) Jz/^ a.o. It is immediately seen that 
the problem admits an optimal solution v of the same structure as g: v = UDiag{v}V^. 
Specifying v reduces to solving the convex optimization problem 

this convex problem with separable objective and a single separable constraint clearly 
can be solved within machine precision in 0{n) a.o. Finally, given v, it takes 0(l)Jz^^ 
operations to compute the x-component UDiag{v}V^ of the prox mapping. Thus, the 
total cost of the x-component of the prox mapping is 0{Jv^ a.o. The situation with 
computing the y-component of the mapping is completely similar, and the cost of this 
component is 0(l)m^ a.o. 

Looking at (34), we see that computing 0(1) values of F at a "general position" points x, 
assuming all the data matrices dense, is 

Cp = 0{l)um{u + m)I a.o. 

As a result, the arithmetic cost of finding e-solution to (33) (and thus - to (32)) by the deter- 
ministic version of SMP is 

©prox p 

Cd(e) = 0(l)ln(mn) [w? + Jv'^ + mv{m + v)l] — a.o. (38) 



Note that we are not aware of better complexity bounds for large-scale problems (32), at least 
in the case when in the expression for ©, the term rn? is dominated by the sum of other terms. 



min 

i;eR":||u||i<l 
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C. Now let us look whether we can reduce the overall arithmetic cost of e-solution to (32) 
by randomization. An immediate observation is that the only case when it can happen is the 
one of Qp ^ ©prox- Indeed, comparing the efRciency estimates (30) and (37), we conclude 
that randomization can only increase the iteration cost of e-solution; in order to overweigh the 
growth in the number of iterations, we need to reduce significantly the arithmetic cost of an 
iteration, and to this end, this cost, in the deterministic case, should be by far dominated by the 
cost of computing the values of F (the only component of our computational effort which can 
be reduced by randomization). Assuming 0/ ^ ©prox, let us look which kind of randomization 
could be useful in our context. Note that in order for randomization to be useful, the underlying 
distributions Pz should be supported on the set of those pairs {x, y) for which computing an 
estimate g of F{z) according to (28) is much cheaper than computing F at a general-type point 
(x, y) E Ex X Ey. A natural way to meet this requirement us to use the "matrix analogy" of 
Example 1, where are supported on the set of low rank matrices. Specifically, in order to 
get an unbiased estimate of F{z), z — {x,y) & X xY, let us act as follows: 

1. We compute singular value decomposition x = UDia.g{a{x)}V^ of x and eigenvalue de- 
composition y — WDia.g{a{y)}W^ of y ^, where U, W are block-diagonal nxn orthogonal 
with 1/ X 1/ diagonal blocks, and W is an orthogonal mx m matrix. 

2. We specify P^ as the distribution of a random matrix ^ G E^ with takes the values 

Mx)\\,Colj[U]ColJ[V],l<j<n 

with the probabilities (Tj(a;)/||cr(a;)||i (when a{x) = 0, ^ takes value with probability 1); 
here 001^(^4) denotes j'-th column of a matrix A. 

3. We specify Py as the distribution of the random symmetric matrix r] which takes values 
Colj[W^], 1 < i < m, with probabilities crj(y), and specify P^ as the direct product of P^ 
and Py. 

Observe that the expectation of ~ is exactly z, and that Pz, z&Z = XxY, is supported 
on Z due to ||cr(a;)||i < 1, a; G X, ||cr(|/)||i = 1, y eV. In other words, assumption B is satisfied 
with p — 0. 

Note that with the just defined Pz, a reahzation ( — (^, rj) ~ Pz is of very special structure: 

^ = uxv'^,u,v eW, 7] = ww^, w e R™; (39) 

moreover, among the J consecutive //-dimensional blocks Uj,Vj, j = 1, .... J, oi every one of the 
vectors u,v & R"'^'^'', all but one blocks are zero, and the nonzero blocks , share a common 
index j. 

''Note that the singular values of y are the same as eigenvalues, since y hO due to y &Y. 
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It is immediately seen that with the just defined distributions P^, the unbiased estimate 
(28) of F{z) is as follows: 

- k^l^ JJiagS 2^ [<liUj [Vj \ ttiW [w J Oj + QiUj [Vj \ aiW [w \ 
e=i L i--j{i)=j 



" e=i 1=1 



+5^ 2£-l[^,2^-llT„. , c^^,2^-lr 2£-liTL. 

(40) 

where the collections 

(K;.. •4]K;-;^Sr,^'Kr) , ^ = l,...,2max[A;,,A;,] 

are independently of each other drawn from P^. 

It is immediately seen that the arithmetic cost of computing {Gx,Gy) given z — {x,y) is 
comprised of the components as follows: 

1. "setup cost" - one of computing singular value decomposition of x and eigenvalue decom- 
position of y ^ {0{l){m^ + Ju^) a.o.) plus the cost of computing the "cumulative distribu- 
tions" Sj{x) = Et = Var{x)/\\a{x)\\,, l<j< Ju, Si{y) = ZUiMv) {0{l){m + Ju) 
a.o.). 

2. After the setup cost is paid, for every i 

— generating costs 0(l)(ln(m) + ln(Jz/) + m + z/) a.o., 

— computing the contribution of to costs no more than 0(l)/i/(m + u) a.o. (look 
at (40) and take into account that the vectors v'^^~^ have a single nonzero u- 
dimensional block each), and this cost should be paid times; 

— computing the contribution of to Gy costs at most 0(1) (m + v)'^K a.o., where 
K = maxi<j< J Cardji : — j} (the same argument as above), and this cost can be 
paid at most 2ky times. 

Thus, the cost of computing (G^"^, Gy'") is 

0(1) (m^ + Jv^ + k:c'^{m + v)I + ky{m + vfK) a.o., K = max Card{i : = j}. (41) 

i<j<J 

To simplify the analysis to follow, assume from now on that I — J and j(-) is one-to-one. In 
this case K — 1 and the cost of an iteration is 

0(1) {m^ + Ju^ + k^u{m + u)J + ky{m + uf) a.o. (42) 

Now let us evaluate the overall complexity of finding, with confidence 1—6,5'^ 1, an e-solution 
by the randomized SMP. We assume from now on that e < V (otherwise the problem is trivial, 
since DualityGap(2;) < V for every z e X xY). For the sake of simphcity, we restrict ourselves 



^In fact, this cost is nonexisting: by construction of the method, the points z where one needs to evaluate 
F are the values of already computed prox-mappings; according to how we compute these values (see above), 
they go together with their singular value/eigenvalue decompositions. 
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with the case of = ky = 1. Invoking the efficiency estimate (30) and taking into account 
(36) and the fact that we are in the situation of p = 0, the number t of iterations which results 
in DualityGap(a;*, I/*) < e is bounded from above by 

K4e) = 0(l)ln(mn)ln(l/(5)(V/e)2, 

meaning that the iteration count now is nearly square of the one for the deterministic algorithm, 
see (37). Taking into account (42), the overall complexity of achieving our goal with the 
randomized algorithm does not exceed 

Cr,5(e) = 0(1) In(mn) ln(l/5) [m^ + Jv^ + (m + z/)(m + vJ)] {Y/ef a.o. 

The ratio of this quantity and the "deterministic complexity" (see (38) and take into account 
that we are in the case of / = J) is 



n = ^ = o{i)Hi/5) 



rrfi + v^J + (m + i/)(m + vJ) 
m? + v^J + mv{m -\-v)J 



V 

e 



It is immediately seen that when V/e and 5 are fixed, and m,iy,J vary in such a way that 
m,n = vJ go to oo and u/m, m/n go to 0, r goes to 0, meaning that eventually the randomized 
algorithm outperforms its deterministic competitor, and the "performance ratio" goes to oo as 
the sizes m, n of the problem grow. 

Numerical illustration. In the experiment we are about to describe, the sizes of problem 
(32) were selected as 

m = 300, /i = i/ = 2, 7 = J = 5000, = i 

which results in dimx — 20000, dimy = 45150. The data matrices qi >z 0, Oj, bi, ci were gener- 
ated at random and normalized to have spectral norms 1, which ensures V < 1. A generated 
instance was processed as follows: 

• first, it was solved by the deterministic Mirror Prox algorithm (DMP) with on-line ad- 
justable "aggressive" stepsize policy [8]; up to this policy, this is nothing but SMP with Pz 
specified as the unit mass sitting at 2;, z e Z; 

• next, it was solved by SMP (10 runs) with — l,ky — 100 ^ and the stepsize pohcy 



— a mm 



1 + 



\/3>C' \/7ctVt 



,T=1,2, 



with C and a given by (27) (where we replace V by its valid upper bound 1) and (29) (where we 
use Q^x^^Y as given by (36)). When a = 1, our stepsize policy becomes the "rolling horizon" 
version of (11); it can be shown that this policy (which does not require the number t of steps 
to be chosen in advance) is, theoretically, basically as good as it constant stepsizes prototype). 
The role of the "acceleration factor" a > 1 is to allow for larger stepsizes than those given by 
the worst-case-oriented considerations underlying (11), the option which for DMP is given by 
the aforementioned on-line adjustable stepsize policy (in our experiments, the latter resulted 



^with our m, J, the coefficient at in the right hand side of (41) is nearly 30 times larger than the one 
at ky, this is why we use ky ^ kx- 
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Iteration count 


CPU, sec 


Algorithm 


min 


mean 


max 


min 


mean 


max 


DMP 


61 


2167 


SMP 


251 


281 


351 


496 


571 


708 



Table 1: . Effect of randomization, problem (33) {I = J = 5000, m = 300, ;U = u = 2). In the 
table: DMP/SMP - Deterministic/Randomized Mirror Prox. Data for SMP are obtained in 10 
runs of the algorithm. Running times include those needed to check the termination criterion. 



in stepsizes which, at average, were fa 250 times the "theoretically safe" ones). The value of 
a we used (1000) was selected empirically in a small series of pilot experiments and was never 
revised in the main series of experiments a . 

• In every experiment, a solution with the duality gap < e = 0.01 was sought. Since the 
duality gap is not directly observable, this goal was achieved as follows. Prom time to time 
(specifically, after every 30 iterations for DMP and every 50 iterations for SMP) we computed 
F{z*) for the current approximate solution ~ (3:*,y*) (see (9)), thus getting g := Vx</'(a;*, y*) 
and A{x^) = Vy0(x*,?/*). We the compute the maximal eigenvalue 0"*" = \uia.x{A{x^) , which is 
nothing but 4>{x^) = max^gy 0(a;, y), and the quantity 0~ = mmx^x[<f>{x^ , V^) + Tr([a; — x^Yg)], 
which is a lower bound on (j){y^) = minxex 4>{x,y^). The quantity A = 0+ — 0~ is an upper 
bound on DualityGap(x*, y*), and the relation A < e = 0.01 was used as the termination 
criterion. 

The results of a typical experiment are presented in table 1. We see that while randomization 
increases essentially the iteration count, it results in overall reduction of the CPU time by a 
quite significant factor. It makes sense to note that of 2167 sec CPU time for DMP, 91% (1982 
sec) were spent on matrix- vectors multiplications, and just 9% - on computing prox-mappings; 
for SMP, both these components take nearly equal times. 



5.2 Illustration II: low dimensional approximation 

Consider the problem as follows: we are given n unit vectors aj e R"*, 1 < j < n, and know 

that for some given A;, 1 < A; < m/2, and S G (0, 1) all a^'s are at the || ■ ||2-distance at most 
6 < 1 form certain /o- dimensional subspace L, common for all points. The problem is to recover 
this subspace^, which reduces to solving the problem 

n 

Opt^ = meixmrn yjajxtti, (43) 

where Vk — S"* is the family of all orthoprojectors of rank k on R^, and Y — {y & R" : 

Yl/jVj = 1} is the standard simplex in Ey — R"*. The set Vk is nonconvex; we relax it to the 
set 

X = {x e S"^ : J™ ^ a; b 0, Tr(x) = A;}, 
thus arriving at the relaxed saddle point problem 

-Opt = min^^gx maxj^gy [0(a;, y) := - YTj=i Vjajxaj] 
Fx{x, y) = - ELi %%oJ> Fy{x, y) = [ajxai] a^xan] 



'^Note the difference with the PCA - Principal Component Analysis: we want to minimize the maximal, over 
i, deviation of aj, from L rather than the sum of squares of these deviations. 
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(we have equivalently transformed the relaxed problem to fit our standard notation). Note that 
is a polynomial of degree d = 2 (just bilinear). Let us apply to 44 our approach. 



Scale factor. We clearly have V < 1 (recall that ||aj||2 = I, ~< x ~< 1^ ior x E X, and 
||y||i < 1 for y e Y). 



Setup. We set 



My) = ^E;iil/j^^P=l/(21n(n)), 



thus getting d.-g.f.'s for X, Y compatible with || ■ || • ||y, respectively (Proposition A. 2 and 
Remark A.l), the corresponding radii oi X,Y are 

< 0(l)y/k\n{k)/ln{m/k), Qy < 0{l)y/\n{n), (46) 

see (67). 



Deterministic algorithm. When solving 44 within accuracy e < 1 by the deterministic 
algorithm DMP, 

— the iteration count is N^ie) = o(i) Wfc)+in(n) ^ 

— the complexity of an iteration is 0(1) (m^ + n) a.o. for computing prox-mappings and 
0{l)m^n a.o. for computing the values of F. 

Note that as far as deterministic solution algorithms are concerned, the outlined bounds result 
in the best known to us overall arithmetic complexity of finding an e-solution in the large scale 
case. 

When n 3> m, the cost of prox-mapping is much smaller than the one of computing the 
values of F, implying that there might be room for accelerating by randomization. 



Randomization. In order to compute, given z = {x,y) e X xY, unbiased random estimates 
of Fx{x,y) and Fy{y), we act as follows. 

1. We associate with y the distribution Py on Y as follows: rj Py takes the values Cj 
(basic orths in R" with probabilities y^, 1 < j < n (cf. Example 1); the corresponding 
random estimate of Fx{x,y) takes the values —ajaj with probabilities yj, 1 < j < n. 
Generating the estimate requires the "setup cost" of 0{n) a.o.; after this cost is paid, 
generating of the estimate takes 0(l)[ln(n) + m^] a.o. 

2. We associate with x & X the distribution P^. on X as follows. Given x, we compute 
its eigenvalue decomposition x = UDmg{^}U^ . The vector C, belongs to the polytope 
(5 = G R™ :< < l,^j6 = k}. Now, there is a simple algorithm [7, section A.l] 
which allows, given ^ e Q, to represent ^ as a convex combination ^™ ^ A,^* of extreme 
points of Q (which are Boolean vectors with exactly k entries equal to 1); the cost of 
building this representation is 0{l)km^ a.o. We build this representation; and define Px 
as the distribution of a random symmetric matrix which takes values f/Diag{,^*}f/"^ with 
probabilities Xi, 1 < i < m, so that the random estimate of Fy{x, y) is the vector with the 
entries G| = ^^^/.(af Col£[t/])^, I < j < n, where is the set of indexes of the k nonzero 
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entries of the Boolean vector and i takes values 1, ...,m with probabilities Ai, Xm- 
Finally, we set = Px x Py Note that this distribution is supported on X x Y (i.e., 
Assumption B is satisfied with p — 0). The "setup" cost of samphng from P^ is 0{l)m^ 
a.o.; after this cost is paid, generating a sample value of costs 0{l)kmn a.o. 

With the outlined randomization, the cost of generating a sample value of Gk^^ky in the range 
ln(n) < 0(l)m^ costs 

0(1) (m^ + kxkmn + kyW?) a.o. 

When n :$> m ^ k and kx, ky are moderate, this cost is by far less than the cost 0(l)m^n of 
deterministic computation of F{x,y), so that our randomization indeed possesses some poten- 
tial. Analysis completely similar to the one in section 5.1 shows that our current situation is 
completely similar to the one in the latter section: while with kx = 0(1), ky = 0(1), the itera- 
tion count for the randomized algorithm is proportional to instead of being proportional to 
e~^, as for the deterministic algorithm, the growth in this count, in certain meaningful range of 
values of fc, m, n, e is by far overweight by reduction in the cost of an iteration. As a result, for 
e fixed and in the case of appropriate proportion between k, m, n, the randomized algorithm 
progressively outperforms its deterministic competitor as the sizes of the problem grow. 

Numerical illustration. In the experiment we are about to describe, the sizes of problem 
44 were selected as 

m = 100, A; = 10, n = 300, 000. 

The data points Uj were selected at random in certain "smart" way aimed at creating difficult 
instances; we are not sure that this goal was indeed achieved, but at least the PCA solution 
(which, with the straightforward random generation of aj, turns out to recover perfectly well 
the approximating subspace) was "cut off:" - the largest, over all j, distance of a/s to the 
k — 10-dimensional PCA subspace in our experiments was as large as 0.99. 

Implementation of the approach was completely similar to the one outlined in section 5.1; 
the only specific issue which should be addressed here is the one of termination. Problem 44 by 
its origin is no more than a relaxation of the "true" problem (43), so solving it within a given 
accuracy is of no much interest. Instead, we from time to time (namely, every 10 iterations) took 
the x-component of the current approximate solution, subject it to eigenvalue decomposition 
and checked straightforwardly what is the largest, over j < n, || • ||2-deviation D of aj from the 
/c- dimensional subspace of R™ spanned by k principal eigenvectors of x*. We terminated the 
solution process when this distance was < S + e, where e is a prescribed tolerance. 

Typical experimental results are presented in table 2. The results look surprisingly good - 
the iteration count is quite low and are the same for both deterministic and randomized algo- 
rithms. We do not know whether this unexpected phenomenon refiects the intrinsic simplicity 
of the problem, or our inability to generate really difficult instances, or the fact that wc worked 
with although reasonable, but not "really small" values of e; this being said, we again see that 
randomization reduces the CPU time by a quite significant factor. 
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Method 


^ of steps 


CPU, sec 


Final deviation D 




DMP 


20 


478 
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SMP 


20 


104 


0.427 


S = 0.6, f) + f = 0.G5 
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20 
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0.(303 




SMP 


20 


105 


0.620 


6 = 0.8,6 + 6 = 0.85 


DMP 


20 
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SMP 


20 


92 
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A Proofs 



A.l Proof of Lemma 4.1 

In what follows, Ci arc positive quantities depending solely on d, and Z is the convex hull of 
{0} U Z. Observe that L[Z] = Lm{Z - Z) D L[Z] and Z' := l[Z - Z] D Z'; as a result, 

\\z\\z < \\z\\ e L[Z]. (47) 

1°. Observe that for some Ci one has 

y{z e Z,2 <k <d) : \Qk{z, z)\ < dV. (48) 
Indeed, let z & Z. The univariate polynomial 



p{t) ■.^^{tz)^J2Qk{z,...,z)t'' 



k=2 

on the segment < i < 1 is bounded in absolute value by V (since V is the variation of (f) on 
Z 3 and 0(0) = 0), so that the moduh \Qk{z, ■■■■,z)\ of its coefficients are bounded by CiV 
for some Ci depending solely on d. 

2^. Our next observation is that for some C2 one has 

W{zeL[Z],2<k<d) : \Qk{z,...,z)\ < C2V||^|||. (49) 

Indeed, let 2 < A; < d. By homogeneity it suffices to verify (49) when \\z\\2 — 1, so that 
z — ^[z^ — z^\ with some z^^z^ e Z. Setting /i(ii, ^2) = t^z^ + ^2-^^, consider the polynomial of 
two variables 

p(ti, ts) = Qkih{ti,t2), /i(ti, ts), h{ti, ts)). 

p is a polynomial of degree < k < d on the 2D plane which is bounded in absolute value by Ci V 
in the triangle ti,t2 > 0,ti + t2 < 1 (by (48) combined with the fact that for the outlined ti, t2 
we have /i(ti, ^2) = (1 — ti — 12) ■ + hz^ + ^2-2^ G Z). As a result, the moduli of the coefficients 
of p do not exceed C3V with appropriately chosen C3, whence p(l/2, —1/2) = Qk{z, ...,2;) is 
bounded in absolute value by C2V with appropriately chosen C2. 

3°. Now let 2 < A: < rf, and let z^,...,z'' e L[Z], \\z'\\2 < 1, I < i < k. Consider the 
polynomial of k real variables 

k k k 

i=l 1=1 1=1 

The degree of this polynomial does not exceed k < d, and 

b(tl, 4)1 < C2V\\hz' + ... + tfc^'^lll < C2Y\\t\\\ 

by (49). It follows that for some C4 we have 

d^p{ti, ...,tk) 



dtkdtk-i-.dti 



< C4V 
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The left hand side in this relation is k\\Qk{z^, z'')\ (recall that Qk{-, ■■■,■) is A;-linear and 
symmetric), and we see that 

V{/ e L[Z], \\z% < : \Qk{z\ ...,z')\ < ^V, 
which by homogeneity implies (26). 

4^. It remains to prove the "in particular" part of Lemma 4.1. Taking into account (23), (20) 
— (22), to this end it suffices to verify that the second order directional derivative D^(l){z)[h, h] — 
'^\t=o^^'^ ~'~ ^^^^^ ^ point z e Z along a direction h e L[Z] satisfies 

\D'<i>{z)[h,h]\<c\\hr 

with C given by (27). This is immediate: by (2) we have 

d 

D^(i){z)[h,h] = Y,k{k-Wk{h,h,z,...,z). 

k=2 

We have \\z\\2 < 2 by definition of || • ||^ (recall that z E Z), so that by (26) the modulus 
of the right hand side docs not exceed Ylk=2^i^ ~ l)2'^~^||/i|||C^-^^V. It remains to note that 
\\h\\2 < \\h\\ due to he L[Z] and (47). □ 



A. 2 Proof of Lemma 4.2 

1°. Let, as always, Z be the convex hull of {0} U Z, and let us fix 2; e Z. Consider the random 
vectors (x, Cy taking values in E^, Ey, respectively: 

Cx Gx [z , Z ]■) Cy Gy [z , Z ]■, C [Cxj Cy] i 

^x^Cx- Fx{z), 5y^Cy- Fy{z), 5 = [5x; 5y], 

where 2;^, z^~^ are drawn, independently of each other, from P^. We claim that for some C5, 
depending solely on d, it holds 

\\S\U < C^Vil + pf-'. (50) 
Indeed, by construction of G[z^, z'^~^] and in view of (2) we have 

\/hGn7]-i = T,t=ikQkiDh,z\...,z>'-') 

^ ^'X {F{^),h) = Et=ikQk{Dh,z,...,z) 
^ = max Eti HQkiDh, z, z) - Q,{Dh, z\ z^-')] (51) 

n&L[Z\,\\n\\<.i. 

< max Et2^C'«V||m||^[||z|||-^ + ||^i^||z2||^...||z'=-l^] 

h:\\h\\<l ^ 

where the concluding inequality is due to (26) (take into account that h E L[Z] = L[X] x L\Y], 
whence Dh e L[Z] C L[Z], and that z,z^ e Afr(Z) C L\Z]). Invoking (47), we get \\Dh\\^ < 
\\Dh\\ = \\h\\. Besides this, z E Z implies that \\z\\2 < 2, while Assumption B combines with 
(47) and the relation < 2 for all e Z to imply that < 2(1 + p). In view of these 

observations, the concluding quantity in (51) is < ^^=2 ^^'^^^'^^'^"^[^ + (1 + p)''~^]j that 
< C5V(1 + p)'^~^ with C5 depending solely on d, as claimed in (50). □ 
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2°. We need the following fact: 

Proposition A.l Let F be a Euclidean space, \\ ■ \\ be a norm on F, || • ||* be the conjugate 
norm, let E be a Polish space equipped with a Borel probability distribution, and T be the space 
of all Borel mappings f :E ^ F such that for some Cf e (0, oo) it holds E{exp{||/(-)||^/cj}} < 
exp{l}. Then 

(i) F is a linear space, and the quantity a[f] = inf{c > : E{exp{||/(-)||^/c^}} < exp{l}} 
is a (semi)norm on J^; 

(ii) Let U be a convex compact set in F such that = ^[U — U] is the unit ball of the norm 
II ■ ||. Assume that U admits a d.-g.f. uj{-) compatible with \\ ■ \\, and let Q be the u-radius of 

. Then for properly chosen absolute constant 0(1), with x = 0(1)^^ the following holds true: 

(!) Let fi, /2, ... be an F -valued martingale-difference, that is, a sequence of random 
vectors taking values in F and such that E\t_i{ft} = for all t, where E|(_i is 
the conditional expectation w.r.t. taken w.r.t. the a-algebra spanned by fi, ft-i- 
Assume that for a sequence of nonnegative deterministic reals cri,(72, ... it holds 

E|,_i{exp{/,(-)||^/a2}} <exp{l} a.s. 

Then for every t one has 



CT[fi + ... + ft]<X^J2l=a'r- (52) 

Proof, (i) is well known; for the sake of completeness, here is the proof. The fact which indeed 
needs verification is the triangle inequality. Thus, let f , g E JF , a > erf/] and h > clg]; all we 
need is to prove that a -\- b > a[f -\- g]. Setting X — a/{a -\- b), we have 

exp{||/ + ^||^/(a + tn < exp{[||/||. + |b||.]7(a + 6)^ 

= exp{[A(||/||./a) + (1 - X)i\\g\U/bm < Aexp{(||/||./a)n + (1 - A) eMiMU/bT}, 

where the concluding < is due to the convexity of the univariate function exp{s^}. Taking 
expectations in the resulting inequality, we get E{exp{||/ + g'||*/(a + 6)^} < exp{l}, that is, 
a-\- b > a[f + (?], as claimed, (i) is justified. 

(ii): Let ip{u) = uj{u/2) : ^ H, and let /(^) = max^^l^[{^,u) — ^^{u)] be the Fenchel 
transform of ip. Since u is strongly convex, modulus 1, w.r.t. || ■ ||, is strongly convex, modulus 
1/4, w.r.t. II • II, whence, by the standard properties of the Fenchel transformation, / possesses 
Lipschitz continuous gradient, specifically, ||/'(0 ~ /'(^)ll ^ 4||^ — r]\\^ for all ^, rj. The Fenchel 
transform of the function '4>-{u) — '^{—u) is /_(0 — f{~0- V'* be the inf-convolution 

of and i.e., the function 

^%u) = inf^,^;^+^=„ {^Ij{v) + i'-{w)) = miy^u,':v-w'=u{^{v) + ^{w')) 

r eiu) := min,,^,,i^^,_,^jV(^) + i^W)], ueU^^'^[U-U] 
\ +00, u^U^ 

The Fenchel transform of the inf-convolution of ip and ijj- is the sum of the Fenchel transforms of 
ip and ip- (recall that the functions are convex with closed compact domains and are continuous 
on their domains), that is, it is the function g{^) = /(^) + /(— 0- particular, the Fenchel 
transform of ^(•) satisfies ||5''(0 ~5''(^)ll < 8||C~^II*- By the standard properties of the Fenchel 
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transform, it follows that 6{-) is strongly continuous, modulus |, on its domain (which is exactly 
the unit ball t/* of the norm || • ||, and the variation (the maximum minus the minimum) of 6 on 
the domain is fi^ (since the variation ^(■) over ^U, that is, the variation of a;(-) over U, is fl'^/2. 
The bottom line is that the unit ball C/* of || • || admits a continuous strongly convex, modulus 1 
w.r.t. II ■ II, function (specifically, 8^(-)) with variation over not exceeding 80^. Invoking [5, 
Proposition 3.3], it follows that the space {F, \\ ■ ||^,) is 0{1)Q^ regular (for details, see [5]). With 
this in mind, the conclusion (!) in (ii) is an immediate consequence of [5, Theorem 2.1.(ii)]. □ 



3^. Now we can complete the proof of Lemma 4.2. We have already seen that SO generates 
unbiased random estimates of F, whence SOk^^ky possesses the same property; thus, SOk^^ky 
meets the requirement (10.6), which is the first claim in Lemma 4.2. Now let us prove the 
second claim in this Lemma. In the notation from item 1*^, setting F = L[X] and denoting by 
TT the orthoprojector of E^. onto F C E^, (50) implies that 

h6^\\x,* = \Mx,* < C^Yil + pY^' (53) 

(since ||5||* = ||5a;||x,* + ||^j/||y,*)- The x-component of the "observation error" of SOk^^ky 
(the difference A = [A^.; A^^] of the random estimate of F{z) generated by SO^^^ky and F{z)) is 

A. = 5^/*^7rA, = 5]/;, (54) 

where /i, fk^ are independent copies of the zero mean random vector k~^n6x G F. Besides 
this, choosing a point x e X and setting X — X — x C F, — u{x + ^), ^ e X, we see that 
X^ — ^[X — X\ admits a d.-g-f., specifically, which is compatible with jj • jjx and is such 
that the w-radius of X is VLx- Invoking Proposition A.l.(ii) and taking into account that we 
are in the situation a[ft] = (j[ft] < C^k~^Y{l + pY~^ by (53), we get that for properly chosen 
Ce depending solely on d we have 

E{exp{||A,||^,,/?^}} <exp{l}, a, ^ C^^x^fil + pY'^ / Vk'. 

(note that ||A^||x,* = ||7rA^||x,*||). Besides this, by (53) \\Jt\\x,* < C^k-^Vil + pY'^ almost 
surely, whence 

E {exp{|| A.II^^Ja^}} < cxp{l}, a, = C,V{1 + p)'^-^ 
The bottom line is that with properly selected Cy depending solely on d and with 

a, = C7V(1 + pY'' min[l, Qx/VK] 

we have 

E{exp{||A,||^,,/<7^}}<exp{l}. 
By similar reasons, with properly selected Cg depending solely on d and with 

ay = CsV{l + pY'' niin[l, Qy/v^] 

we have 

E{exp{||A,||^>^}}<exp{l}. 

Taking into account that ||A||* = ||Aj;||x,* + jjAyjjy* and item (i) of Proposition A.l, the second 
claim in Lemma 4.2 follows. □ 
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A. 3 Proofs for section 5 



A. Let S"^ be the space of m x m symmetric matrices equipped with the Frobenius inner 
product; for y G S™, let \{y) be the vector of eigenvalues of y (taken with their multiplicities 
in the non-ascending order). For an integer k, 1 < k < m, let = {y e S™ : \\X{y)\\oo < 
1, ||A(j/)||i < k}, so that Y'^ is the unit ball of certain rotation-invariant norm || ■ on S"*. 

Lemma A.l Let m,n, k be integers such that m > n > k > 1, and let F be a linear subspace 
in S™ such that every matrix y & F has at most n nonzero eigenvalues. Let, further, q e (0, 1), 
and let 



^ IIL 



The function x(-) is continuously dijferentiable, convex, and its restriction on the set Yp — 
{y e F : < 1} is strongly convex w.r.t. \\ ■ modulus 

/3 = gmin[l,^A;^+%-«]. (55) 

Proof. 1°. Observe that 

X{y) = Tr(/(2/)), f{s) = Y^l^r^'^- (56) 

Function f{s) is continuously differentiable on the axis and twice continuously differentiable 
outside of the origin; consequently, we can find a sequence of polynomials fr{s) converging, 
as r — )■ cxD, to / along with their first derivatives uniformly on every compact subset of R 
and, besides this, converging to / uniformly along with the first and the second derivative on 
every compact subset of R\{0}. Now let y,h e S"*, let y = uDi&g{X}u^ be the eigenvalue 
decomposition of y, and let h — uhu^ . For a polynomial p{s) — X]^=o^'^*^' setting P{w) — 
Tr(5^^^QP^i(;^) : S"* — > R, and denoting by 7 a closed contour in C encirchng the spectrum of 
y, we have 

(a) P(y) = Tr(p(y)) = Er=iP(A,(|/)) 

(6) DP{y)[h\ = l^{T.lo^Pl^{y'-'h)) = l^{p'{y)h) = E7=iP'(Ai(?/))4- 
(c) D^P{y)[K h] = il^,DP{y + th)[h] = ||,_ Jr(p'(|/ + th)h) 

= I L=o 2^ / ^(^(^^ -(y + th))-')p\z)dz ^^J T.{h{zl - y)-'h{zl - y)-')p\z)dz 

7 7 

2-Ki J 2^i,j=l ""13 {z-\{y)){z~\j{y))"'^ 2^i,j=l »J ' 

P'(A.fa))-p'(A,-fa)) X (V)^X (V) 

' \ P"{\{y)), \{y) = Hy) 

We conclude from (a, 6) that as A; — >■ 00, the real-valued polynomials Fr{-) — Tr(/r(-)) on S"* 
converge, along with their first order derivatives, uniformly on every bounded subset of S'", 
and the limit of the sequence, by (a), is exactly x(-). Thus, %(•) is continuously differentiable, 
and (6) says that 

m 

Dx{y)[h]^J2f'(^^(y))^n- (57) 
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Besides this, (a-c) say that if [/ is a closed convex set in S™ which does not contain singular 
matrices, then Fr{-), as r oo, converge along with the first and the second derivative uni- 
formly on every compact subset of U, so that x(-) is twice continuously differentiable on [/, 
and at every point y e C/ we have 

D\iy)[h, h] = J2 ^ir,, r., = ';^f-,Yy^ ' ,f ^ (58) 

and in particular %(•) is convex on U . 

S'^. We intend to prove that (i) x(-) is convex, and (ii) its restriction on the set Yp is strongly 
convex, with certain modulus a > 0, w.r.t. the norm || • Since x is continuously differen- 
tiable, all we need to prove (i) is to verify that 

{^{y')-^{y"U-y")>^ (*) 

for a dense in x set of pairs {y',y"), e.g., those with nonsingular y' — y". For a pair 
of the latter type, the polynomial q(t) = Det{y' + t{y" — |/')) of t G R is not identically zero 
and thus has finitely many roots on [0, 1]. In other words, we can find finitely many points 
to — < ti < ... < tn — 1 such that all "matrix intervals" Aj = (y,, j/j+i), yk — v' + tk{y" — y')-, 
1 < ?' < n — 1, are comprised of nonsingular matrices. Therefore x is convex on every closed 
segment contained in one of Aj's, and since x is continuously differentiable, (*) follows. 
A^. It remains to prove that with (3 given by (55) one has 

(x'(y') - x'(/), - y") >^\y'- y"\\ W, y" e (59) 

Let e > 0, and let be a convex open in Y'^ = {y : \\y\\(k) ^ 1} neighborhood of Yp such that 
for all y e at most n eigenvalues of y are of magnitude > e. We intend to prove that for 
some ctg > one has 

{x\y') - x'{y"),y' - y") > c^elW - y%) "^y'.y" e y\ (60) 

Same as above, it suffices to verify this relation for a dense in x Y'^ set of pairs y' , y" G y^, 
e.g., for those pairs y\y" G Y^ for which y' — y" is nonsingular. Defining matrix intervals Aj 
as above and taking into account continuous differentiability of x, it suffices to verify that if 
y G Aj and h — y' — y", then D^x(y)[/i, h] > a^lhU. To this end observe that by (58) all we 
have to prove is that 

m 

D\{y)[h,h]=J2hl^^J>^e\\h\\ly (#) 

5^. Setting Xj = Xj{y), observe that Aj 7^ for all i due to the origin of y. We claim that if 
W\ > |Aj|, then Fij > q\Xi\'i~^. Indeed, the latter relation definitely holds true when Aj = Aj. 

Now, if Aj and Aj are of the same sign, then Fjj = |_|^ j' > gf|Aj|^~^, since the derivative of 
the concave (recall that < g < 1) function of t > is positive and nonincreasing. If Aj and 
Aj are of different signs, then Fjj = > |Aj|^~^ due to |Aj|* > |Aj||Aj|*~^, and therefore 

Tij > 9|Aj|'^~^. Thus, our claim is justified. 

W.l.o.g. wc can assume that the positive reals fii = |Aj|, i = 1, ...,m, form a nondecreasing 
sequence, so that, by above, Fjj > qi^'j~^ when i < j. Besides this, at most n of /^.j are > e. 
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since y', y" e and therefore y e by convexity of F^. By the above, 

m 

i<j<m j=l 

or, equivalent ly by symmetry of /i, if 



hji hj2 



hij 
h2j 



and Hj is the Probenius norm ll/i-^ llpro of , then 



D\{y)[h,h]>qJ2H]f^r'- 

6°. Now note that 

m 

< /Xj < 1 Vj, fj,j < e, j < m — n, /Xj < k 
due to y e C and > for all j. Now, by the definition of || • ||(jt), setting 



(k) 



\H{k)], 



(61) 



(62) 



observe that either r] is the spectral norm ||A(/;,)||oo of h, or kr] is the nuclear norm of h. In 



the first case, the Frobenius norm of /i is > r], meaning that YlT=i 



Fro 



rj'^. Since 



q G (0, 1) and < fij < 1 for all j by (62), we conclude from (61) and from the evident relation 
Fro = '12 j 1 1 ^"'11 Pro = '12 j Hj that in the case in question we have 



D\{y)[h,h]>qJ2Hf>qv' = q 



2 

(fe)- 



Now assume that we are in the second case: 

/c||/i||(fc) = kr] 



\h\ 



\h\ 



(63) 



(64) 



Observe that are matrices of rank < 2, so that ||/i-'' ||nuc < V^Hj, and since H = Yl^=i h'' ■> 

have ||/i||nuc < J2j nuc ^ y]j Hj, which combines with (64) to imply the first inequality 
in the following chain: 



2 

nuc 



(q-l)/2. (l-q)/2> 

jH'j iH'j 



< 2 {ET=l^^VH!) {ET=if^]-j [Cauchy inequality] 

< 2q-'D\iy)[h,h] fell (61)] 

< 2g-iD2^(|/) h] [{m - n)e^-<^ + Er«+i /^]"') [by (62)] 



< 2q-^D\{y)[h, h] ((m - n)e^-i + k^-ini) . [by (62)] 



^ [since < g < 1] 
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Thus, in the case of (64) we have 
Setting 



M 1 ^ ||;,||2 



1 

a. = gminfl, — 1 (65) 

^ ^ ' 2(m-n)ei-« + ^ ' 

and recaUing (63), we arrive at the desired inequahty (#). 

7*^. As we have aheady explained, {^) implies the validity of (60) (and therefore - the validity 
of (59)) with a — a^. Since ctg — >■ /3 as e ^ +0, (59) indeed is satisfied with (3 given by (55). □ 



B. Lemma A.l is the key to the two statements as follows. 

Proposition A. 2 Let k,m be integers such that 1 < k < m/2, and let X — {x & S"* : x >z 
0, Tr(a;) = k}. The function 

r min[l,ln(A;)/ln(m/A;)], k>l ^ / ^' J" - ^ (66) 

\ 1/(2 ln(m)), k^l I ^^^2Vi), k^l 

is convex continuously differentiahle function on E which is strongly convex, modulus 1 w.r.t. 
II • ||x; on X and thus is a d.-g.f. for X compatible with || • ||x- The ui-radius of X satisfies 



Proof. The only non-evident statement is that cu is strongly convex, modulus 1 w.r.t. || • ||x, 
on X, and this is what we are about to prove. Let || • ||(fe) be the norm on S"* with the unit 
ball Y'^ = {yeS^: \\X{y)\U < 1, l|A(y)||i < k}, and let 



|1+<Z 



When k > \frn. Y contains the unit ball of the Frobenius norm, and consequently || ■ < 
II ■ IIfio, and q = 1, meaning that the function %(■) = ||| ■ ||Fro strongly convex, modulus 1, 
w.r.t. II • ||pro, and therefore is strongly convex, modulus (3 := 1, w.r.t. || ■ ||(fc) < || • Hfto- Let now 
k < \fm. In this case q G (0, 1), and therefore, by Lemma A.l, % is strongly convex, modulus 
^ := gminfl, |A;^+«m-«], on Y^ . Note that ^ = g/2 when A; > 1 and /3 = q/{'2y/e) when k ^ I. 

Now observe that X clearly is contained in Y^ , implying that x(x) is strongly convex, 
modulus (3 w.r.t. || ■ ||(fe), on X. At the same time, we claim that the || ■ ||x-unit ball X'^ C 
L[X] = {x G S™ : Tr(x) = 0} contains the set {x G L[X] : ||a;||(fc) < 1/2, meaning that 
II ■ ||x < 2|| ■ ll(fc) on as a result, x(") is strongly convex, modulus /3/4 w.r.t. || ■ ||x, on X, 

so that uj{x) — (4/^)x(x) is strongly convex, modulus 1 w.r.t. || • ||x, on X, and this is exactly 
what we want to prove. To support our claim, let x G L[X] be such that ||a;||(fc) < 1/2, and let 
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X = [/Diag{^}t/^ be the eigenvalue decomposition of x. Since x e L[X] and < 1/2, we 

have 

m m 

(a) : J]0 = 0, {b) : < l/2Vj < m, (c) : 2a ^ < k/2. 
Now let us select 5j > 0, 1 < j < m, in such a way that 

(d): 5j< 1/2 (e): 5, = - a. 

Such a selection is possible due to < 1/2 (by (6)) and X]JLi[l/2 — |^j|] = m/2 — 2a > k/2 — a 
(see (c) and take into account that k < m/2). Now let rj'^ — 2(^+ + S), r]~ — 2{^~ + S), where 
is the vector with coordinates max[,^i, o], and ^~ is the vector with coordinates max[— ^j, 0]. 
We have ^^"^ > (since S > 0) and H^^'^Hoo < 1 (by {d)). Finally, = '^j^J ~ ^ ('^) 

and by the definition of a, whence rjj' = rjj = 2 5j + 2a = A; by (e). These relations 
imply that the symmetric matrices x^ = UDia.g{r]'^}U^ belong to X, and by construction 
X — — x"], so that X e X^, as claimed. □ 



Proposition A. 3 Let K,M,N be integers such that 1 < K < M < N, and let \\ ■ ||(x) be the 
norm on R^x^ with the unit ball X = {x e R^x^ : ||(T(a;)||oo < 1, \W{x)\\i < K}. Then the 
function 

A ^ 

^(^) = ^^—^y,'^]^\x), q^^hi[lM{2K)/\n{M/K)l (68) 

is convex and continuously differentiable, and its restriction on X is strongly convex, modulus 
1 w.r.t. II • ll(ii-), on X . The to-radius Vlx of X satisfies 



""^^'^^wh- '''' 

Proof. The only nontrivial claim is that uj{-) is strongly convex, modulus 1, w.r.t. || • ||(ft:)- 
When q — 1, i.e., when \/2K > M, X clearly contains the ball {x : ||a;||Fro < Vv^}' ^^^^ 
II • II (K) < v^ll • ||pro, and u{x) — ||a;||pro is strongly convex, modulus 2, w.r.t. || • ||pro, and thus 
indeed strongly concave, modulus 1, w.r.t. || ■ ||(fc). Now let q < 1. Let m = M + N , n = 2M, 
k = 2K, so that 1 < k < m, and let A{x) = be the linear mapping from R^^Af ^^^^ 

S™. It is well known that the eigenvalues of A{x) are the n = 2M reals ±ai{x), 1 < i < M, 
and m — n zeros. Therefore for the norm || • ||(fc) specified in Lemma A.l it holds 

INIw = M(^)llwVxeR^><^. (70) 

By Lemma A.l, the function u^{y) = ^prp^ X^jli \^j{y)\^~^'^ is convex and continuously dif- 
ferentiable on the entire S"*, and its restriction on the set Y = {y e Im(^) : ||y||(fc) < 1} 
is strongly convex, modulus 1 w.r.t. || • ||(fe), on Y, implying, due to (70), that the function 
u}{x) = u~^{A{x)) is convex and continuously differentiable on R^^xa^_ ^j^j j^^g restriction on the 
unit ball X of the norm || • ||(i^) is strongly convex, modulus 1 w.r.t. || • \\{k), on . □ 
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Remark A.l Note that inspecting the proofs, it is easily seen that the results of Propositions 
A. 2, A. 3 remain true if when one replaces S™ (resp., R^><^ with their suhspaces comprised 
of block- diagonal matrices of a given block- diagonal structure. E.g., when 1 < k < m/2, the 
function 

with q, P given by (66) is a d.-g.f. for the set X — {x & R"* : Q < Xj < IVj, ^"^^Xj = k} 
compatible with the norm || • \\x with the unit ball — ^[X — X] on the space L[X] — 
Lm(X — X) = e R"* : X^j^j — 0} (if^o-i m-dimensional vectors as diagonals of m x m 
diagonal matrices). 
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